General introduction to data visualization
Introduction to ggplot2 in R
Practices with ggplot2
Introduction to other graphics package in R
10/05/2019
General introduction to data visualization
Introduction to ggplot2 in R
Practices with ggplot2
Introduction to other graphics package in R
Exploratory data analysis
Explore pattern, trend, and distribution
Find correlation between variables
Regression analysis
Statistical analysis
Report your results
Communicate with non-statisticians
Share findings
Show fancy plots to your audiences
One variable: Histogram, Bar chart, Density plot…
Two variables: Scatter plot, Box plot, Violin Plot…
Multiple variables: Heatmap…
Think of your data and variables carefully, and choose the most appropriate statistical plot.
Better summary of statistics than table and text.
Easy to show a trend or a pattern in the data.
A more interesting way to catch your audiences’ eyes.
For fun…
## TEAM SEASON WIN. PTS OFFRTG DEFRTG PACE REGION ABV ## 1 Atlanta Hawks 2015-2016 0.585 102.8 104.6 100.8 97.63 East ATL ## 2 Atlanta Hawks 2018-2019 0.354 113.3 107.5 113.1 104.56 East ATL ## 3 Atlanta Hawks 2017-2018 0.293 103.4 104.4 110.1 98.76 East ATL ## 4 Atlanta Hawks 2016-2017 0.524 103.2 104.5 105.2 97.76 East ATL ## 5 Boston Celtics 2015-2016 0.585 105.7 105.8 102.5 99.43 East BOS ## 6 Boston Celtics 2016-2017 0.646 108.0 110.6 108.0 97.21 East BOS
WIN.: Winning rate, which is the percentage of games played that a team has won.
PTS: The number of points scored.
OFFRTG: Offensive Rating, which measures a team’s points scored per 100 possessions.
DEFRTG: Defensive Rating, which is the number of points allowed per 100 possessions by a team.
PACE: Pace, which is the number of possessions per 48 minutes for a team.
REGION: East/West.
ABV: The abbreviation of a team.
be hard to read if labels and legends are not clear
confuse people if it is not well-designed
deliver misleading information (sometimes in purpose)
graphics: The R basic graphics package
ggplot2: The grammar of graphics
plotly: Interactive plot in RShiny
leaflet: Interactive maps
graphics package in Rplot(OFFRTG ~ DEFRTG, data = nba.data); plot(WIN. ~ SEASON, data = nba.data)
Fuctions to create complete plots:
plot(), boxplot(), hist()…Functions to add elements to an existing plot:
points(), lines(), legend()…Grammar of graphic
Both quick and complex plot in an easy way
Nice aesthetic settings
Great docummentation and tons of online instructions
The histogram of winning rate in different regular NBA seasons and regions:
ggplot(data = nba.data, aes(x = WIN.)) + geom_histogram(binwidth = 0.1, color = "black") + facet_grid(REGION ~ SEASON)
graphics packagepar(mfrow = c(2, 4), mar = c(2, 2, 3, 1))
for(i in levels(nba.data$REGION)){
for(j in levels(nba.data$SEASON)){
subdata <- subset(nba.data, REGION == i & SEASON == j)
hist(subdata$WIN., breaks = seq(0, 1, 0.1),
main = paste(i, j, sep = " ,"))
}
}
Idea: graph is a combination of independent building blocks.
Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes.
Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, such as points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways.
The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape.
A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic.
A facet describes how to break up the data into subsets and how to display those subsets as small multiples.
A theme which controls the finer points of display, like the font size and background colour.
ggplot() is always the first line of the ggplot.
We can specify the data set and the aesthetics mapping variables in the ggplot().
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) p
Map the variables in the data to the components in the plot
x: x axis
y: y axis
color: color of the boundary of a symbol
fill: color of the inside of a symbol
shape: shape of points, solid point, circle, triangle…
size: size of points
linetype: type of lines, solid line, dashed line…
…
Geometries are the actual graphical elements displayed in a plot. They can visualize the mapping variables (specified in aes()) from the data.
We use + to connect multiple geometrics
p + geom_point()
data and aes in geom function. They don’t have to be the same as those in ggplot().ggplot() + geom_point(data = nba.data, aes(x = DEFRTG, y = WIN.))
geom functionp <- ggplot(data = nba.data, aes(x = WIN.)) p + geom_histogram(binwidth = 0.1) p + geom_density()
geom functionp <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) p + geom_point(); p + geom_line(); p + geom_density_2d(); p + geom_smooth(method = "lm")
geom functionp <- ggplot(data = nba.data, aes(x = SEASON, y = WIN.)) p + geom_boxplot() p + geom_violin()
geom functionp <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + facet_wrap(~SEASON) p + geom_text(aes(label = ABV), size = 2)
geom layersggplot(data = nba.data, aes(x = WIN.)) + geom_histogram(aes(y = ..density..), binwidth = 0.1, color = "black") + geom_density()
geom layersggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm")
geom layersggplot(data = nba.data, aes(x = SEASON, y = WIN.)) + geom_violin() + geom_boxplot(width = 0.2)
geom functions is importantggplot(data = nba.data, aes(x = SEASON, y = WIN.)) + geom_boxplot(width = 0.2) + geom_violin()
Sometimes we need to transform the data set to keep variables consistent with the structure of the aesthetics.
For instance, if we want to compare the mean of winning rate between seasons and regions…
mean.win <- aggregate(WIN. ~ SEASON + REGION, FUN = mean, data = nba.data) head(mean.win)
## SEASON REGION WIN. ## 1 2015-2016 East 0.4942000 ## 2 2016-2017 East 0.4828667 ## 3 2017-2018 East 0.4903333 ## 4 2018-2019 East 0.4780667 ## 5 2015-2016 West 0.5056000 ## 6 2016-2017 West 0.5170667
Then we can generate a plot to compare the mean of winning rate based on the new data set.
Facet function can help you make panel plot very easily
facet_wrap wraps a 1d sequence of panels into 2d.
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm", se = FALSE) p + facet_wrap(~SEASON)
facet_grid forms a matrix of panels defined by row and column faceting variables.p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm", se = FALSE) p + facet_grid(REGION ~ SEASON)
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm", se = FALSE) p + facet_grid(REGION ~ SEASON, scales = "free")
The scale functions control how the plot maps data values to the visual values of an aesthetic, for instance,
scale_x_continuous
scale_y_discrete
scale_color_gradient
scale_fill_manual
You can also specify the label of axis or legends in the scale funtion。
scale_*_continuous: change the scale for continuous variable
scale_*_discrete: change the scale for discrete variable
scale_*_identity: use values without scaling
scale_*_manual: create your own discrete scale
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point() p
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point()
p + scale_x_continuous(name = "offensive rate", limits = c(97, 116)) +
scale_y_continuous(name = "winning rate", breaks= seq(0, 1, 0.1)) +
scale_color_manual(name = "region", labels = c("EAST", "WEST"), values = c("blue", "red"))
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN., color = REGION)) + geom_point() p + scale_x_reverse()
General purpose scales also work for color and fill.
R color cheatsheet: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
Some scale functions designed to control the scale of color and fill, for instance,
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, color = WIN.)) + geom_point() p + scale_color_gradient(low = "green", high = "red")
General purpose scales also work for size, shape and linetype.
Reference for shape and linetype code: http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
p <- ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, shape = REGION)) + geom_point()
p + scale_shape_discrete("Region", solid = FALSE)
coord_* function control the transformation of the coordinate systemsp <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm") p + coord_fixed(ratio = 20); p + coord_flip(); p + coord_trans(y = "sqrt"); p + coord_polar()
theme_* functionp <- ggplot(data = nba.data, aes(x = OFFRTG, y = WIN.)) + geom_point() + geom_smooth(method = "lm") p + theme_bw(); p + theme_classic(); p + theme_grey(); p + theme_minimal()
labs function can set the title, subtitle and caption of your plot.
theme function is a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. See R help for details.
grid.arrange from gridExtra package can place multiple ggplot on a pagegrid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
ggsave can save the plot to your local drive.ggsave(p, filename = "", height = , width = , units = )
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
R help is also a great resource.
Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
GGally: An extention to reduce the complexity of combining geometric objects with transformed data
ggExtra: A package which can add marginal density plots or histograms to ggplot2 scatterplots.
ggrepel: A convenient package for geom_text()
gganimate: A grammar of animated graphics
more information: http://www.ggplot2-exts.org/gallery/
ggpairs: Make a matrix of plots with a given data set.ggpairs(data = nba.data, 3:7)
ggcorr: plot a correlation matrix (heatmap) with ggplot2ggcorr(data = nba.data[, c(3:7)])
ggMarginal: Create a ggplot2 scatterplot with marginal density plots (default) or histograms, or add the marginal plots to an existing scatterplot.p <- ggplot(nba.data, aes(x = OFFRTG, y = DEFRTG, color = REGION)) + geom_point() + theme_bw() + theme(legend.position = "bottom") ggMarginal(p, groupColour = TRUE, groupFill = TRUE)
geom_text_repel can solve the problem of overlapping labels when we plot text on the graph.ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
geom_point(aes(color = REGION), shape = 1) +
geom_text_repel(data = subset(nba.data, WIN. >= 0.6 | WIN. <= 0.3),
aes(label = ABV), size = 1.5, box.padding = 0.3) +
facet_wrap(~SEASON)
geom_label_repel draws a rectangle underneath the text, making it easier to read.ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
geom_point(aes(color = REGION), shape = 1) +
geom_label_repel(data = subset(nba.data, WIN. >= 0.6 | WIN. <= 0.3),
aes(label = ABV), size = 1.5, box.padding = 0.3) +
facet_wrap(~SEASON)
ggplot(data = nba.data, aes(x = OFFRTG, y = DEFRTG, size = WIN.)) +
geom_point(aes(color = REGION), shape = 1) +
geom_text_repel(aes(label = ABV), size = 1.5, box.padding = 0.3) +
theme_bw() +
scale_y_reverse(limits = c(120, 97)) +
scale_color_manual(values = c("blue3", "red3")) +
# Here comes the gganimate specific bits
labs(title = 'SEASON: {closest_state}', x = 'OFFRTG', y = 'DEFRTG') +
theme(title = element_text(size = 5),
text = element_text(size = 2)) +
transition_states(SEASON,
transition_length = 2,
state_length = 1)
Introduction to data visualization
Grammar of graphics: ggplot2
Practices with basketball game data
Extensions of ggplot2